A61 update: Avoid re-resolution without any connection failures by ejona86 · Pull Request #539 · grpc/proposal

ejona86 · 2026-03-06T23:34:46Z

This new behavior we will be guaranteed not to request re-resolution more often every $NUM_SUBCHANNEL subchannel failures.

Some "initial pass" references were changed to "first pass" to use consistent language.

This change has limited impact due to DNS's limit of 1 query per 30 seconds. But it will prevent noop updates from causing extra DNS resolutions. This will be low priority to implement in languages, although in Java there's benefit in implementing it now to help when our "match pre-dualstack behavior" flag is enabled.

This new behavior we will be guaranteed not to request re-resolution more often every $NUM_SUBCHANNEL subchannel failures. Some "initial pass" references were changed to "first pass" to use consistent language.

A61-IPv4-IPv6-dualstack-backends.md

markdroth · 2026-03-09T19:06:28Z

A61-IPv4-IPv6-dualstack-backends.md

@@ -143,18 +147,18 @@ all times, with no regard for the order of the addresses.  Each
 individual subchannel will provide [backoff behavior][backoff-spec],
 reporting TRANSIENT_FAILURE while in backoff and then IDLE when backoff
 has finished.  The pick_first policy will therefore automatically
-request a connection whenever a subchannel reports IDLE.  We will count
-the number of connection failures, and when that number reaches the
-number of subchannels, we will request re-resolution; note that because
+request a connection whenever a subchannel reports IDLE.  When the
+number of connection failures reaches the number of subchannels (and not
+in the first pass), we will request re-resolution; note that because
 the backoff state will differ across the subchannels, this may mean that
 we have seen multiple failures of a single subchannel and no failures
 from another subchannel, but this is a close enough approximation and
 very simple to implement.


I think it might be clearer to decouple the discussion of re-resolution behavior from the rest of the description here by putting it in its own paragraph:

Suggested change

- We will wait for at least one connection attempt on every address to

fail before we consider the first pass to be complete. As per

[gRFC A62][A62], we will report TRANSIENT_FAILURE state and will

continue trying to connect. We will stay in TRANSIENT_FAILURE until

either (a) we become connected or (b) the LB policy is destroyed by the

channel shutting down or going IDLE.

If the first pass completes without a successful connection attempt, we

will switch to a mode where we keep trying to connect to all addresses at

all times, with no regard for the order of the addresses. Each

individual subchannel will provide [backoff behavior][backoff-spec],

reporting TRANSIENT_FAILURE while in backoff and then IDLE when backoff

has finished. The pick_first policy will therefore automatically request a

connection whenever a subchannel reports IDLE.

Each time a subchannel reports TRANSIENT_FAILURE, we will increase a

counter for the number of connection failures. The counter is reset any time

re-resolution is requested. When the number of connection failures reaches

the number of subchannels (and not in the first pass), we will request

re-resolution; note that because the backoff state will differ across the

subchannels, this may mean that we have seen multiple failures of a single

subchannel and no failures from another subchannel, but this is a close

enough approximation and very simple to implement.

Your use of the github "suggestion" feature is broken. You might want to avoid using it next time or properly select the multiple lines necessary to make it useful. (Or... I don't even know. I see you selected multiple lines now. I have no clue what github is doing. Although the multiple lines selected still looks suspect.)

That seems strictly worse to me, as it needs to alternate between "first pass" and "not first pass". It's also not as clear whether this paragraph is a continuation of the "not first pass" paragraph above; in fact, reading what you wrote I'd assume it is a continuation and only applies when we swap "modes", because the "(and not in the first pass)" is a parenthetical.

I don't understand why "note that because the backoff" is combined with a semicolon with the previous sentence. They seem unrelated.

It looks like your new reorganization broke the language, because now we're not requesting re-resolution when we leave the first pass. Was that your intention? I can't actually tell your intention with the language/organization present.

markdroth · 2026-03-09T19:08:54Z

A61-IPv4-IPv6-dualstack-backends.md

+number of connection failures reaches the number of subchannels (and not
+in the first pass), we will request re-resolution; note that because


Why not in the first pass? If we're going to trigger this based solely on the number of connection attempt failures, then shouldn't we do that regardless of whether or not we're in the first pass? It seems like if we're already in TF and we get a new address list, we shouldn't effectively reset the counter.

I didn't think I changed any semantics here. I just made it more explicit. This paragraph only applies when we swap "modes" away from first pass.

I don't think we need to re-resolve during the first pass as we haven't actually tried all the addresses yet. We trigger re-resolution after that completes. If we are doing a first pass, we're not guaranteed to be in TF...

I could understand basing some of this off of "in TF" vs "not in TF," but the existing design uses "in first pass" vs "not in first pass." So changes there are more invasive.

A61 update: Avoid re-resolution without any connection failures

496695d

This new behavior we will be guaranteed not to request re-resolution more often every $NUM_SUBCHANNEL subchannel failures. Some "initial pass" references were changed to "first pass" to use consistent language.

ejona86 requested review from dfawley, markdroth and murgatroid99 March 6, 2026 23:34

ejona86 mentioned this pull request Mar 6, 2026

core: Don't set firstPass=true on address update in pick_first grpc/grpc-java#12679

Open

ejona86 requested a review from easwars March 6, 2026 23:39

dfawley reviewed Mar 9, 2026

View reviewed changes

A61-IPv4-IPv6-dualstack-backends.md Show resolved Hide resolved

dfawley reviewed Mar 9, 2026

View reviewed changes

A61-IPv4-IPv6-dualstack-backends.md Outdated Show resolved Hide resolved

"using Happy Eyeballs"

fd718ed

dfawley approved these changes Mar 9, 2026

View reviewed changes

markdroth reviewed Mar 9, 2026

View reviewed changes

ejona86 requested a review from markdroth March 12, 2026 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A61 update: Avoid re-resolution without any connection failures#539

A61 update: Avoid re-resolution without any connection failures#539
ejona86 wants to merge 2 commits intogrpc:masterfrom
ejona86:dualstack-pf-reduce-refresh

ejona86 commented Mar 6, 2026

Uh oh!

Uh oh!

Uh oh!

markdroth Mar 9, 2026

Uh oh!

ejona86 Mar 9, 2026 •

edited

Loading

Uh oh!

markdroth Mar 9, 2026

Uh oh!

ejona86 Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

+- We will wait for at least one connection attempt on every address to
+  fail before we consider the first pass to be complete.  As per
+  [gRFC A62][A62], we will report TRANSIENT_FAILURE state and will
+  continue trying to connect.  We will stay in TRANSIENT_FAILURE until
+  either (a) we become connected or (b) the LB policy is destroyed by the
+  channel shutting down or going IDLE.
+If the first pass completes without a successful connection attempt, we
+will switch to a mode where we keep trying to connect to all addresses at
+all times, with no regard for the order of the addresses.  Each
+individual subchannel will provide [backoff behavior][backoff-spec],
+reporting TRANSIENT_FAILURE while in backoff and then IDLE when backoff
+has finished.  The pick_first policy will therefore automatically request a
+connection whenever a subchannel reports IDLE.
+Each time a subchannel reports TRANSIENT_FAILURE, we will increase a
+counter for the number of connection failures. The counter is reset any time
+re-resolution is requested.  When the number of connection failures reaches
+the number of subchannels (and not in the first pass), we will request
+re-resolution; note that because the backoff state will differ across the
+subchannels, this may mean that we have seen multiple failures of a single
+subchannel and no failures from another subchannel, but this is a close
+enough approximation and very simple to implement.

		number of connection failures reaches the number of subchannels (and not
		in the first pass), we will request re-resolution; note that because

Conversation

ejona86 commented Mar 6, 2026

Uh oh!

Uh oh!

Uh oh!

markdroth Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

ejona86 Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markdroth Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

ejona86 Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ejona86 Mar 9, 2026 •

edited

Loading